Quantifying Signal Purity by means of Machine Learning

ABSTRACT

A system includes a memory and a processor. The memory is configured to store a machine learning (ML) model. The processor is configured to (i) obtain a set of training audio signals that are labeled with respective levels of distortion, (ii) convert the training audio signals into respective images, (iii) train the ML model to estimate the levels of the distortion based on the images, (iv) receive an input audio signal, (v) convert the input audio signal into an image, and (vi) estimate a level of the distortion in the input audio signal, by applying the trained ML model to the image.

FIELD OF THE INVENTION

The present invention relates generally to processing of audio signals,and particularly to methods and systems for quantification of audiosignal purity.

BACKGROUND OF THE INVENTION

An audio system is typically regarded as “high quality” if the ratiobetween the input signal and the added audio artefacts, which are aby-product of the system itself, is kept to a minimum. Such artefactscan be divided into noise, non-harmonic distortion and harmonicdistortion. Sensing and quantifying such artefacts is needed both fordesigning better systems and for providing real-time control ofautomatic-tuning systems.

Techniques for sensing of distortion in audio signals have beenpreviously proposed in the patent literature. For example, U.S. Pat. No.10,559,316 describes systems and methods that provide distortionsensing, prevention, and/or distortion-aware bass enhancement in audiosystems, that can be implemented in a variety of applications. Sensingcircuitry can generate statistics based on an input signal received forwhich an acoustic output is generated. In some embodiments, the sensingcircuitry is operable to compute a soft indicator corresponding to alikelihood of distortion or a degree of objectionable, perceptible, ormeasurable distortion, at an output of the speaker using a techniqueselected from a group including machine learning, statistical learning,predictive learning, or artificial intelligence.

SUMMARY OF THE INVENTION

An embodiment of the present invention that is described hereinafterprovides a system including a memory and a processor. The memory isconfigured to store a machine learning (ML) model. The processor isconfigured to (i) obtain a set of training audio signals that arelabeled with respective levels of distortion, (ii) convert the trainingaudio signals into respective images, (iii) train the ML model toestimate the levels of the distortion based on the images, (iv) receivean input audio signal, (v) convert the input audio signal into an image,and (vi) estimate a level of the distortion in the input audio signal,by applying the trained ML model to the image.

In some embodiments, the distortion includes a Total Harmonic Distortion(THD).

In some embodiments, the processor is configured to convert a giventraining audio signal into a given image by setting pixel values of thegiven image to represent an amplitude of the given training audio signalas a function of time.

In some embodiments, the respective images and the image aretwo-dimensional (2D).

In some embodiments, the respective images and the image are of three ormore dimensions.

In an embodiment, the processor is configured to obtain the trainingaudio signals by (i) receiving initial audio signals having firstdurations, and (ii) slicing the initial audio signals into slices havingsecond, shorter durations, so as to produce the training audio signals.

In some embodiments, the ML model includes a convolutional neuralnetwork (CNN)

In some embodiments, the ML model includes a generative adversarynetwork (GAN).

In an embodiment, the input audio signal is received from nonlinearaudio processing circuitry.

In an embodiment, the ML model classifies the distortion according tothe levels of distortion that label the training audio signal.

In another embodiment, the ML model estimates the level of distortionusing regression.

In some embodiments, the processor is further configured to control,using the estimated level of the distortion, an audio system thatproduces the input audio signal.

There is additionally provided, in accordance with another embodiment ofthe present invention, a system including a memory and a processor. Thememory is configured to store a machine learning (ML) model. Theprocessor is configured to (i) obtain a plurality of initial audiosignals, which have first durations in a first range of durations andwhich are labeled with respective levels of distortion, (ii) slice theinitial audio signals into slices having second durations in a secondrange of durations, shorter than the first durations, so as to produce aset of training audio signals, (iii) train the ML model to estimate thelevels of the distortion based on the training audio signals, (iv)receive an input audio signal having a duration in the second range ofdurations, and (v) estimate a level of the distortion in the input audiosignal by applying the trained ML model to the input audio signal.

In some embodiments, the processor is configured to train the ML modelby (i) converting the training audio signals into respective images and(ii) training the ML model to estimate the levels of the distortionbased on the images.

In some embodiments, the processor is configured to estimate the levelof the distortion in the input audio signal by (i) converting the inputaudio signal into an image and (ii) applying the trained ML model to theimage.

In some embodiments, the respective images are two-dimensional (2D)images.

In some embodiments, the respective images are of three or moredimensionals.

There is further provided, in accordance with another embodiment of thepresent invention, a method including obtaining a set of training audiosignals that are labeled with respective levels of distortion. Thetraining audio signals are converted into respective two-dimensional(2D) images. A machine learning (ML) model is trained to estimate thelevels of the distortion based on the 2D images. An input audio signalis received. The input audio signal is converted into a 2D image. Alevel of the distortion in the input audio signal is estimated byapplying the trained ML model to the 2D image.

There is furthermore provided, in accordance with another embodiment ofthe present invention, a method including obtaining a plurality ofinitial audio signals, which have first durations in a first range ofdurations and which are labeled with respective levels of distortion.The initial audio signals are sliced into slices having second durationsin a second range of durations, shorter than the first durations, so asto produce a set of training audio signals. The ML model is trained toestimate the levels of the distortion based on the training audiosignals. An input audio signal is received, having a duration in thesecond range of durations. A level of the distortion in the input audiosignal is estimated by applying the trained ML model to the input audiosignal.

The present invention will be more fully understood from the followingdetailed description of the embodiments thereof, taken together with thedrawings in which:

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a graph showing the effects of audio compression by a DynamicRange Compressor (DRC) configured with short and long response times onan audio signal, in accordance with an embodiment of the presentinvention;

FIG. 2 is a block diagram schematically illustrating a system forestimation of virtual total harmonic distortion (vTHD) of a short audiosample output by an audio processing apparatus, in accordance with anembodiment of the present invention;

FIG. 3 shows a set of two-dimensional (2D) images used in training anartificial neural network (ANN) in the system of FIG. 2 , in accordancewith an embodiment of the present invention;

FIG. 4 illustrates a confusion matrix comparing vTHD estimated using thesystem of FIG. 2 to a ground-truth THD of FIG. 3 , in accordance with anembodiment of the present invention; and

FIG. 5 is a flow chart that schematically illustrates a method forestimation of vTHD of a short audio sample using the system of FIG. 2 ,in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION OF EMBODIMENTS Overview

Audio (e.g., music or voice) is primarily a form of acoustic energyspread over a continuous or discrete range of frequencies. One techniqueto characterize the audio quality of an audio device is to measure theTotal Harmonic Distortion (THD) that the device introduces into an inputaudio signal. THD is a measure of the harmonic distortion present in asignal, and is defined as the ratio of the sum of the powers of allharmonic components to the power of a fundamental frequency, thefundamental frequency being a sinewave.

When the main performance criterion is the “purity” of the original sinewave (in other words, the contribution of the original frequency withrespect to its harmonics), the measurement is most commonly defined asthe ratio of the RMS amplitude, A, of a set of higher harmonicfrequencies to the RMS amplitude of the first harmonic, or fundamental,frequency:

${THD} = \frac{\sqrt{A_{2\omega}^{2} + A_{3\omega}^{2} + A_{4\omega}^{2} + {A_{5\omega}^{2}\ldots}}}{A_{\omega}}$

In audio systems, a lower THD (i.e., lower distortion) means that audiocomponents such as a loudspeaker, an amplifier, a signal processingunit, a microphone or other audio equipment, produce a more accuratereproduction of the original input audio.

The distortion of a waveform relative to a pure sinewave, for example,can be measured either by using a THD analyzer to analyze the outputwave into its constituent harmonics and noting the amplitude of eachharmonic relative to the fundamental, or by cancelling out thefundamental with a notch filter and measuring the remaining signal,which will be a total aggregate harmonic distortion plus noise.

Given a sine wave generator of very low inherent distortion, thegenerator's output can be used as an input to amplification equipment,whose distortion at different frequencies and signal levels can bemeasured by examining the output waveform. While dedicated electronicequipment can be used to both generate sinewaves and to measuredistortion, a general-purpose digital computer equipped with a soundcard and suitable software can carry out harmonic analysis.

Identifying various different frequencies from an incoming time-domainsignal is typically done using a Fourier transform, which is based onmathematical integration. This process requires a signal with a minimaltime duration to achieve a specific spectral resolution required of themeasurement. Therefore, THD can only be well defined for a sufficientnumber of cycles of an incoming time-domain signal. For example, tomeasure a low frequency sine wave (e.g., a bass monotone at 100 Hz andcorresponding cycle of 10 mSec), the incoming time-domain signal must bestable over at least several hundred milliseconds (i.e., at least overseveral tens of cycles).

This means that THD cannot be estimated for an “instantaneous” audiosignal, such as an audio performance during a sound-dominant portion ofa beat of a drum that, typically, lasts a few tens of milliseconds atmost. The human ear, on the other hand, can recognize distortion of sucha drum beat.

In particular, the absence of a THD measurement precludes (a) using themeasure to design a more linear system (when the distortion isunintentional), and (b) using the measure, including in real time, tocontrol (e.g., limit) an amount of intentional distortion, such as thatintroduced by a non-linear audio processing element.

Embodiments of the present invention that are described herein providesystems and methods that define and estimate a level of the distortionin an audio signal, by applying a machine learning (ML) model (e.g., anartificial neural network (ANN)) and artificial intelligence (AI)techniques, such as using a trained ML model. Some embodiments defineand estimate a harmonic distortion by defining and estimating “virtualTHD′ (vTHD), which can be described as a measure of an instantaneousTHD. For audio signals for which THD is well defined, vTHD coincideswith THD up to a given tolerance (e.g., allowing a classification errorto a nearest labeled THD value, such as one smaller or one larger of theclassified THD value). However, when THD fails for very short durationaudio signals, vTHD provides a new standard for estimating audio qualitybased on the disclosed technique that estimates vTHD of such signals.

Some embodiments of the disclosed solution are focused on sensing andquantifying harmonic distortions, regardless of noise, in a very shorttime. This feature makes the disclosed techniques applicable to dynamic(i.e., rapidly varying) signals and provides a powerful tool for bettersystem engineering.

The disclosed ML techniques are able to systematically quantify theso-called “instantaneous” THD (i.e., the entity vTHD) on complex signals(e.g., a drum beat) and at very short times (e.g., severalmilliseconds).

To illustrate the challenge and the capabilities of such an MLtechnique, one can consider, by way of example, a Dynamic RangeCompressor (DRC) nonlinear audio device that maps an input dynamic rangeto a smaller dynamic range in the output side. This sort of compressionis usually achieved by lowering the high energy parts of the signal.

There is a strong relation between the response times of a DRC to theamount of harmonic distortion it will create as a side effect. As ageneral example, a very fast response time (e.g., 1 mSec) setting on avery slow signal (e.g., 100 Hz) will create distortions once thecompressor significantly attenuates the output. A DRC might havedifferent response-time operation profiles from which to select. So,with the disclosed technique, a designer and system architect of such adevice can quantify, using a vTHD scale, the distortion level of one DRCdesign over another.

The disclosed technique is by no means restricted to DRCs. A DRCembodiment is described later in detail since DRCs are a very commontool and since DRC's distortion artefact is controllable, making thisuse-case a good tool for explaining the technique.

In some embodiments, the disclosed technique endeavors to detect audiodistortion in an audio signal that is presented as a picture (e.g., into2D information). To this end, the disclosed technique classifies a setof distortions according to a model trained by using signals that weresliced from longer signals having a measurable THD. In particular, theTHD of the longer signals can be measured by a laboratory-gradeanalyzer. The technique trains an ML model with a set of short (e.g.,sliced) signals to classify any short signal according to the sets oflabels, where the label is now converted one-to-one from THD to vTHD,with the vTHD of a distortion determined only by inference.

One scenario that justifies this conjecture on conversion validity is toconsider a long stable signal (e.g., lasting few hundred cycles) fromwhich THD can be measured. By slicing only several cycles of the longsignal, a very short signal is received, on which THD is undefined, butany distortion is still present, and therefore a valid definition ofvTHD scale would follow the rule:

vTHD(sliced_signal):=THD(long_signal)

In one embodiment, a system is provided that includes a memoryconfigured to store a machine learning (ML) model and a processor, whichis configured to perform the following steps:

-   -   (i) obtain a plurality of initial audio signals, which have        first durations in a first range of durations and which are        labeled with respective levels of distortion. In the contexts of        embodiments of this invention “Obtain” means “receive from the        outside and/or produce internally.”    -   (ii) preprocess the initial audio signals slices by slicing the        initial audio signals into slices having second durations in a        second range of durations, shorter than the first durations, so        as to produce a set of training audio signals.    -   (iii) train the ML model to estimate the levels of the        distortion based on the training audio signals. For example,        train the ML model to estimate the vTHD of the preprocessed        audio signals.    -   (iv) receive an input audio signal having a duration in the        second range of durations.    -   (v) estimate a level of the distortion (e.g., level of vTHD) in        the input audio signal by applying the trained ML model to the        input audio signal.

In a particular embodiment, the processor is configured to train the MLmodel by (i) converting the training audio signals into respectiveimages (e.g., two-dimensional (2D) images) and (ii) training the MLmodel to estimate the levels of the distortion based on the images. Theprocessor is configured to estimate the level of the distortion in theinput audio signal (e.g., it's vTHD) by (i) converting the input audiosignal into a 2D image and (ii) applying the trained ML model to the 2Dimage. Note, however, that the disclosed technique can convert audiosignals into multi-dimensional mathematical structures (e.g., 3D andmore), such as tensors, to, for example, utilize dedicated computinghardware such as graphics processing units (GPUs) or tensor processingunits (TPUs). Moreover, given a type of ML model (e.g., a type of NN)which is optimized to another mathematical structure at its input, thedisclosed technique can, mutatis mutandis, convert an audio signal tothat structure, such as a 3D RGB image, and apply it the given type ofthe trained ML model.

The training audio signals are typically labeled according to a groundtruth scale of the THD, to, for example, estimate and classify the newpreprocessed audio signal, during inference, according to the differentlabels of THD. The processor runs the ML model to infer the newpreprocessed audio signal and to classify the new audio signal accordingto the different labels of THD with the respective vTHD. However, as noactual THD measurement could have been performed, the ML model istrained to recognize a distortion pattern on brief signals. In this way,as noted above, the vTHD serves as a consistent scale for comparingaudio processing performance of very short duration signals.

In one embodiment, the processor is configured to preprocess thetraining audio signals by converting each audio signal into a respective2D image. For example, the processor is configured to convert each audiosignal into a respective black and white 2D image by binary coding theaudio signals in a 2D plane comprising a temporal axis and a signalamplitude axis, which is manifested as encoding an area confined by thegraph as black while encoding the rest of the 2D image is white, asdescribed below.

In another embodiment, the training samples are sliced and used in thisway as a 2D image input for training without further preprocessing(e.g., without the black and white area encoding), and a new signal isnot preprocessed before the ML model runs inference on that audiosignal.

In yet another embodiment, the ML model uses ANN as a generativeadversary network (GAN) which is particularly flexible in learning andinferring arbitrary waveforms. In general, various ML models may be usedwith data format optimized (e.g., converted from the audio samples) forthe given ML model.

Moreover, with the necessary changes being made, the disclosed techniquecan identify and estimate audio distortion other than harmonic ones. Forexample, the disclosed technique may be applied, mutatis mutandis, toidentify and estimate one of phase noise, chirp, and damping in audiosignals.

By providing a ML-based audio distortion scale called virtual THD, audioengineers can quantify audio performance that cannot be quantified usingexisting techniques.

DRC-Induced Audio Distortion Over Short Time Durations

The time duration needed for a DRC to respond to (i.e., compress) anincreased input signal (“attack”), or for a DRC to stop its processing(“release”), is a crucial parameter to audio quality. A user cannotsimply “set the attack and release” to a minimum, because an exceedinglyshort attack and release setting creates harmonic distortion. Thisartefact, e.g., THD, is a by-product of the DRC setting in conjunctionwith the input signal and its properties.

The THD of an output signal (i.e., a THD which is a by-product of theDRC setting) is easily noticeable by a human listener and hence each DRChas its attack and release knobs (or auto setting). Even more, THD isviewable on a waveform display.

Albeit being both audible and viewable to a human user, it is quitesurprising to see that there is no measurement method which quantifiesthis distortion. This lack of quantification leads to a reality in whichDRC engineers and system designers lack a scientific measurement toolwhich can help systemize the development process of future DRCs by meansof quantifying the artefacts. As mentioned above, this is true not onlyfor DRC, but in fact to any non-linear processor (Gates, Limiters,Saturators, etc.).

FIG. 1 is a graph 10 that shows the effect of audio compression on anaudio signal, the compression performed by a Dynamic Range Compressor(DRC) configured with short and long response times, in accordance withan embodiment of the present invention.

In the shown embodiment, a compressor or a DRC maps an input dynamicrange 13 of an incoming sinewave signal into a target dynamic range 15,set by the user. This process involves setting (or auto setting) athreshold audio energy, above which the DRC will compress and underwhich the DRC will not alter the signal, the ratio of compression aswell as the attack and release.

In the example of FIG. 1 , the input signal has a fixed frequency of 1KHz with an amplitude that can be varied below and above the thresholdvalue of the DRC. In the example measurement of FIG. 1 , the DRCthreshold is −15 dB, with a compression ratio of 1:99. with twodifferent attack times (10 pSec vs. 2 mSec) the output result distortionis very vivid visually. As seen, the short attack time results in asignal 22 that is highly distorted. On the other hand, a signal 12,which results from the long attack time, is largely a sinewave, withsome amplitude modulation.

However, the different level of distortion exhibited by signals 22 and12 is not quantifiable to date, as explained above. The presentdisclosure provides embodiments that can quantify the differentshort-duration audio distortions (e.g., distortions taking place over atime duration smaller than several milliseconds).

SYSTEM DESCRIPTION

FIG. 2 is a block diagram schematically illustrating a system 201 forestimation of virtual total harmonic distortion (vTHD) of a short audiosample (121) outputted by an audio processing apparatus 101, inaccordance with an embodiment of the present invention.

As seen, system 201 is coupled to audio processing apparatus 101 thatcomprises a linear gain circuitry 103 that does not distort the inputsignal, and a non-linear processor 105, such as the aforementioned DRC,that may distort the linearly amplified input signal. The output signalis directed to an output device 107, such as a loudspeaker.

System 201 for estimation of vTHD is configured to estimate thenonlinear audio effect of audio processing apparatus 101, and inparticular of non-linear processor 105, by providing a vTHD 210 grade ofan unintentional distortion introduced by non-linear processor 105.Using the estimated vTHD enables a user, or a processor, to optimizesettings of apparatus 101 to optimize an intentional amount ofdistortion, such as to limit an intentional distortion to a desiredlevel.

As further seen, system 201 is inputted with an audio signal 121 that isdistorted after being processed by non-linear audio processing circuitry105.

A processor 208, or a preprocessing circuitry 206, performspreprocessing of audio signal 121 by converting (e.g., encoding) the 1Dwaveform of signal 121 into a 2D black and white image 211, such as theimages seen in FIG. 3. In other words, processor 208 converts a giventraining audio signal into a given 2D image by setting pixel values ofthe 2D images to represent an amplitude of the given training audiosignal as a function of time.

Then, processor 208 runs a trained ANN 207 (that can be a convolutionalANN (CNN) or a GAN, to name two options, that is held in a memory 209 toperform inference on image 211 to estimate vTHD 210 of signal 121.

Finally, a feedback line 283 between processor 208 and non-linearprocessor 105 enables controlling the amount of artefacts in outputaudio signal 121, based on the estimated vTHD. Such feedback line mayalternatively, or additionally, be used between processor 208 and lineargain circuitry 103.

The embodiment of FIG. 2 is depicted by way of example, purely for thesake of clarity. For example, preprocessing circuitry 206 may performanother type of preprocessing, or, for a given suitable ML model beingused, perform no preprocessing of the training samples 121 (e.g., asidefrom slicing them after measuring THD).

The different elements of system 201 and audio processing apparatus 101shown in FIG. 2 may be implemented using suitable hardware, such as oneor more discrete components, one or more Application-Specific IntegratedCircuits (ASICs) and/or one or more Field-Programmable Gate Arrays(FPGAs). Some of the functions of system 201 may be implemented in oneor more general purpose processors programmed in software to carry outthe functions described herein.

The software may be downloaded to the processors in electronic form,over a network or from a host, for example, or it may, alternatively oradditionally, be provided and/or stored on non-transitory tangiblemedia, such as magnetic, optical, or electronic memory.

Preprocessing of Audio Signals for Subsequent Determination of vTHDUsing an ANN

FIG. 3 shows a set 202 of two-dimensional (2D) images used in trainingan artificial neural network (ANN) 207 in the system of FIG. 2 , inaccordance with an embodiment of the present invention. As seen, imagesof set 202 are associated with progressively increasing THD levels. TheTHD was measured on training audio signals from which the 2D images weregenerated, the training audio signals being each of 48 cycle length(i.e., samples with duration of 48 mSec) at 1 KHz). The preprocessed 2Dimages were generated after the training audio samples were truncated(e.g., sliced) to leave only five cycles. Thus, the training uses shortduration samples (e.g., of five cycles of a 1 KHz wave), with totalduration of each sample being 5 milliseconds. This duration isconsidered very short and does not allow, for example, meaningful FFTanalysis of harmonic distortion, as emphasized above. In principle, asignal can be truncated to as little as a fraction of a cycle (e.g.,quarter cycle), and the disclosed technique will generate a vTHD scaleof distortion using such ultrashort audio signals. Using truncatedsignals further allows to, for example, maximize tolerance of thedisclosed technique to low signal-to-noise ratio, while gaining on theanalysis of ultra-short duration audio harmonic distortions.

Set 202 of training images is a cascade of preprocessed sine-wavesignals with the initial sine wave signals with an increasing “digitalsaturation” level that clips the sine wave at its minimum and maximumabsolute values. As seen, the clipping is first none, i.e., startingwith zero clipping having a THD=0, with the saturation effect increasingall the way to a maximal clipping that results in a rectangularwave-like waveform with a measure (e.g., ground truth) THD of 28. In thegiven example, the actual testing starts, for simplicity ofpresentation, from 4% THD (i.e., THD=4), as described in FIG. 4 .

The increased level of THD reflects a growing relative contribution to asignal of higher harmonics (3ω, 5ω, 7ω . . . ), pure, sinus harmonics atω.

Each 2D image of set 202 is received from a 1D waveform similarly to howimage 211 is received from respective waveform 121, as described in FIG.2 .

In particular, the preprocessing may use a code that blacks areas 212between the envelope and the horizontal axis, and maintains white therest of each image.

In the particular example exemplified by FIG. 3 , data preprocessingincludes these steps:

1. Data digitization (8-bit): Each waveform i out of N waveforms of aset like set 202 is sampled by a sequence in time {S_(j)} with j beingthe temporal index.

Data normalization: All data sample values are normalized to −1 to 1.

2. Data transformation: In order to use a convolutional NN (CNN)architecture of an ANN, the data is transformed from 1D data (sequencedata-audio signal) to 2D data:

-   -   2.1. Every sine wave sample array is transformed into a matrix        (represented in a greyscale picture).    -   2.2 All matrix cells are initiated as a white color. Each row i        represents the amplitude of the sine wave (with a given        precision). Each column represents the time j of sampling.    -   2.3. Filling the matrix: The amplitude of the wave samples i=1,        2, . . . N is transformed using the equation        Matrix[(1−Amplitude[S_(ij)])*100] [S_(ij)]=0. (Black color).    -   By applying this step, all the areas between Si amplitude and        the zero-amplitude row (were filled in white as well—this was        done to add more data inside every sample. This method maximizes        the contrast of the signals for better image processing.

Analysis of Performance of ANN in Classifying vTHD

In the field of ML, and specifically the problem of statisticalclassification, a confusion matrix, also known as an error matrix, is aspecific table layout that allows visualization of the performance of analgorithm, typically a supervised learning algorithm (i.e., one thatuses labeled training data for learning). Each row of the matrixrepresents the instances in an actual class while each column representsthe instances in a predicted class, or vice versa. The name stems fromthe fact that it makes it easy to see whether the system is confusingtwo classes (i.e., commonly mislabeling one as the other).

FIG. 4 illustrates a confusion matrix 302 comparing vTHD 210 estimatedusing system 201 of FIG. 2 to ground-truth THD of FIG. 3 , in accordancewith an embodiment of the present invention. The number of samplesinferenced at each THD level is indicated by a scale 304, with number ofsamples ranging between few to more than 20.

As seen, for THD>4, the errors made during inference by the trained ANNmodel 207 are deviations by one class at most (for example, some audiosamples with THD=j may have been classified as having VTHD=j+1 orVTHD=j−1). The vast majority of audio samples were accurately classifiedby system 201.

The shown example of FIG. 4 is brought by way of example. As anotherexample, rather than use classification to estimate an error in vTHDcompared to a ground truth THD, a ML model my use a regression-basedscoring, as described below.

Method of Estimating of vTHD of a Short Audio Sample

FIG. 5 is a flow chart that schematically illustrates a method forestimation of vTHD of a short audio sample using system 201 of FIG. 3 ,in accordance with an embodiment of the present invention. Thealgorithm, according to the presented embodiment, carries out a processthat is split between a training phase 401 and an inferencing phase 403.

The training phase begins at an uploading step 402, during whichprocessor 208 uploads a set of short (e.g., sliced) training audiosamples, like the 5-cycle audio sample used in FIG. 3 , from memory 209.Next, processing circuitry 206 converts the audio samples into black andwhite images, as shown in FIG. 3 , at a data format conversion step 404.

In an ANN training step 406, processor 208 trains ANN 209 using theblack and white images to estimate a vTHD of an audio signal.

Inference phase 403 begins by system 201 receiving as an input a shorttime duration audio sample (e.g., of several milliseconds duration), atan audio sample inputting step 408.

Next, processing circuitry 206 converts the short audio sample into ablack and white image, at a data format conversion step 410. Then,processor 208 runs the trained ANN 209 to estimate a vTHD value of theaudio sample, at a vTHD estimation step 412. Finally, at a vTHDoutputting step 414, processor 208 of system 201 outputs the estimatedvTHD to a user, or to a processor, to, for example, adjust a nonlinearaudio stage according to a desired vTHD value, such as to adjust asaturation level imposed by nonlinear audio processor 105 of audioprocessing apparatus 101.

The flow chart of FIG. 5 is brought purely by way of example, for thesake of clarity. For example, other preprocessing steps, or fewer steps,may be used.

Regression-Based vTHD Estimation

As noted above, a regression-based scoring may be used in addition, oras alternative to vTHD estimation by classification shown in FIG. 4 . Ina regression-based scoring, the system uses the same processed data(either the white painted data and/or the black painted can be used). Inthis embodiment, the CNN uses a predicts mean squared error function asa loss function, to output a number that indicates how close the vTHD isto the ground truth THD value.

The algorithm follows the steps of: Preprocessing:

1. Using for training same waveforms same as in the classificationarchitecture.

2. Normalizing THD values stacked to Y vector and normalized for [0,1]values.

3. Data splitting using random generators as in the classificationnetwork.

Outputting:

1. A normalized vTHD value.

2. In case a training audio sample is estimated, outputting an estimatederror between the CNN prediction of vTHD for the sample and the truevalue of THD that was measured on an initial audio signal. For example,assuming the model gives a result of vTHD=0.8 (normalized)—the groundtruth THD may be within the range of [0.75, 0.85].

The accuracy of both the classification method and the regression-basedmethod can be improved with data sampling precision, by, for example,using a 16-bit digitization scheme instead of the 8-bit used.

Note that, mathematically, the data set looks different forclassification and regression problem in terms of Y vector (Forclassification—for every example S_(j) there is a 1D classificationvector. For regression for every example S_(j) there is a scalarregression score.

Although the embodiments described herein mainly address audioprocessing for audio engineering suits and/or consumer grade devices,the methods and systems described herein can also be used in otherapplications, such as audio quality analysis, filter design orauto-self-control of filters for still-images processing or for videoprocessing, and, mutatis mutandis, encoding and decoding techniques fordata compression that are based or partially based on FFT analysis.

It will thus be appreciated that the embodiments described above arecited by way of example, and that the present invention is not limitedto what has been particularly shown and described hereinabove. Rather,the scope of the present invention includes both combinations andsub-combinations of the various features described hereinabove, as wellas variations and modifications thereof which would occur to personsskilled in the art upon reading the foregoing description and which arenot disclosed in the prior art. Documents incorporated by reference inthe present patent application are to be considered an integral part ofthe application except that to the extent any terms are defined in theseincorporated documents in a manner that conflicts with the definitionsmade explicitly or implicitly in the present specification, only thedefinitions in the present specification should be considered.

1. A system, comprising: a memory configured to store a machine learning(ML) model; and a processor, which is configured to: obtain a set oftraining audio signals that are labeled with respective levels ofdistortion; convert the training audio signals into respective images;train the ML model to estimate the levels of the distortion based on theimages; receive an input audio signal; convert the input audio signalinto an image; and estimate a level of the distortion in the input audiosignal, by applying the trained ML model to the image.
 2. The systemaccording to claim 1, wherein the distortion comprises a Total HarmonicDistortion (THD).
 3. The system according to claim 1, wherein theprocessor is configured to convert a given training audio signal into agiven image by setting pixel values of the given image to represent anamplitude of the given training audio signal as a function of time. 4.The system according to claim 1, wherein the respective images and theimage are two-dimensional (2D).
 5. The system according to claim 1,wherein the respective images and the image are of three or moredimensions.
 6. The system according to claim 1, wherein the processor isconfigured to obtain the training audio signals by (i) receiving initialaudio signals having first durations, and (ii) slicing the initial audiosignals into slices having second, shorter durations, so as to producethe training audio signals.
 7. The system according to claim 1, whereinthe ML model comprises a convolutional neural network (CNN)
 8. Thesystem according to claim 1, wherein the ML model comprises a generativeadversary network (GAN).
 9. The system according to claim 1, wherein theinput audio signal is received from nonlinear audio processingcircuitry.
 10. The system according to claim 1, wherein the ML modelclassifies the distortion according to the levels of distortion thatlabel the training audio signal.
 11. The system according to claim 1,wherein the ML model estimates the level of distortion using regression.12. The system according to claim 1, wherein the processor is furtherconfigured to control, using the estimated level of the distortion, anaudio system that produces the input audio signal.
 13. A system,comprising: a memory configured to store a machine learning (ML) model;and a processor, which is configured to: obtain a plurality of initialaudio signals, which have first durations in a first range of durationsand which are labeled with respective levels of distortion; slice theinitial audio signals into slices having second durations in a secondrange of durations, shorter than the first durations, so as to produce aset of training audio signals; train the ML model to estimate the levelsof the distortion based on the training audio signals; receive an inputaudio signal having a duration in the second range of durations; andestimate a level of the distortion in the input audio signal by applyingthe trained ML model to the input audio signal.
 14. The system accordingto claim 13, wherein the distortion comprises a Total HarmonicDistortion (THD).
 15. The system according to claim 13, wherein theprocessor is configured to train the ML model by (i) converting thetraining audio signals into respective images and (ii) training the MLmodel to estimate the levels of the distortion based on the images. 16.The system according to claim 15, wherein the processor is configured toestimate the level of the distortion in the input audio signal by (i)converting the input audio signal into an image and (ii) applying thetrained ML model to the image.
 17. The system according to claim 15,wherein the respective images are two-dimensional (2D) images.
 18. Thesystem according to claim 13, wherein the respective images are of threeor more dimensions.
 19. The system according to claim 13, wherein the MLmodel comprises a convolutional neural network (CNN)
 20. The systemaccording to claim 13, wherein the ML model comprises a generativeadversary network (GAN)
 21. The system according to claim 13, whereinthe input audio signal is received from nonlinear audio processingcircuitry.
 22. The system according to claim 13, wherein the ML modelclassifies the distortion according to the levels of distortion thatlabel the training audio signal.
 23. The system according to claim 13,wherein the ML model estimates the level of distortion using regression.24. The system according to claim 13, wherein the processor is furtherconfigured to control, using the estimated level of the distortion, anaudio system that produces the input audio signal.
 25. A method,comprising: obtaining a set of training audio signals that are labeledwith respective levels of distortion; converting the training audiosignals into respective images; training a machine learning (ML) modelto estimate the levels of the distortion based on the images; receivingan input audio signal; converting the input audio signal into an image;and estimating a level of the distortion in the input audio signal, byapplying the trained ML model to the image.
 26. The method according toclaim 25, wherein the distortion comprises a Total Harmonic Distortion(THD).
 27. The method according to claim 25, wherein converting a giventraining audio signal into a given image comprises setting pixel valuesof the given image to represent an amplitude of the given training audiosignal as a function of time.
 28. The method according to claim 25,wherein the respective images and the image are two-dimensional (2D).29. The method according to claim 25, wherein obtaining the trainingaudio signals comprises (i) receiving initial audio signals having firstdurations, and (ii) slicing the initial audio signals into slices havingsecond, shorter durations, so as to produce the training audio signals.30. The method according to claim 25, wherein the ML model comprises aconvolutional neural network (CNN)
 31. The method according to claim 25,wherein the ML model comprises a generative adversary network (GAN). 32.The method according to claim 25, wherein receiving the input audiosignal comprises receiving the input audio signal from nonlinear audioprocessing circuitry.
 33. The method according to claim 25, wherein theML model classifies the distortion according to the levels of distortionthat label the training audio signal.
 34. The method according to claim25, wherein the ML model estimates the level of distortion usingregression.
 35. The method according to claim 25, and comprisingcontrolling, using the estimated level of the distortion, an audiosystem that produces the input audio signal.
 36. A method, comprising:obtaining a plurality of initial audio signals, which have firstdurations in a first range of durations and which are labeled withrespective levels of distortion; slicing the initial audio signals intoslices having second durations in a second range of durations, shorterthan the first durations, so as to produce a set of training audiosignals; training a machine learning (ML) model to estimate the levelsof the distortion based on the training audio signals; receiving aninput audio signal having a duration in the second range of durations;and estimating a level of the distortion in the input audio signal byapplying the trained ML model to the input audio signal.
 37. The methodaccording to claim 36, wherein the distortion comprises a Total HarmonicDistortion (THD).
 38. The method according to claim 36, wherein trainingthe ML model comprises (i) converting the training audio signals intorespective images and (ii) training the ML model to estimate the levelsof the distortion based on the images.
 39. The method according to claim38, wherein estimating the level of the distortion in the input audiosignal comprises (i) converting the input audio signal into an image and(ii) applying the trained ML model to the image.
 40. The systemaccording to claim 38, wherein the respective images are two-dimensional(2D) images.
 41. The method according to claim 36, wherein the ML modelcomprises a convolutional neural network (CNN)
 42. The method accordingto claim 36, wherein the ML model comprises a generative adversarynetwork (GAN)
 43. The method according to claim 36, wherein the inputaudio signal is received from nonlinear audio processing circuitry. 44.The method according to claim 36, wherein the ML model classifies thedistortion according to the levels of distortion that label the trainingaudio signal.
 45. The method according to claim 36, wherein the ML modelestimates the level of distortion using regression.
 46. The methodaccording to claim 36, and comprising controlling, using the estimatedlevel of the distortion, an audio system that produces the input audiosignal.